library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(readxl)
library(tidyr)
library(plotly)
library(stringr)
library(tidyselect)

1 General Instructions

Your take home final consists of 3 parts. First part is about some simple questions and their answers. These questions might include coding, brief comments or direct answers. Second part is about your group projects. You are asked to make a contribution to your project report with an additional analysis with two/three visualizations. Third part is about gathering real life data and conducting analysis on it. Here are significant points that you should read carefully.

2 Part I: Short and Simple (20 pts)

The purpose of this part is to gauge your apprehension about data manipulation, visualization and data science workflow in general. Most questions have no single correct answer, some don’t have good answers at all. It is possible to write many pages on the questions below but please keep it short. Constrain your answers to one or two paragraphs (7-8 lines tops).

  1. What is your opinion about the AI hype? Do you think that, in 10 years, AI will solve many critical problems? What about the required human capital to build all the AI? If we consider a sufficient level a 100 what would be our level as Turkey / World? Why do you think so? (Give a number for both Turkey and the World and defend your scoring)

  2. What is your exploratory data analysis workflow? Suppose you are given a data set and a research question. Where do you start? How do you proceed? For instance, you are given the task to distribute funds from donations to public welfare projects in a wide range of subjects (e.g. education, gender equality, poverty, job creation, healthcare) with the objective of maximum positive impact on the society in general. Assume you have almost all the data you require. How do you measure impact? How do you form performance measures? What makes you think you find an interesting angle?

  1. If you had to plot a single graph using the diamonds data what would it be? Why? Make your argument, actually code the plot and provide the output. (You can find detailed info about the movies data set in its help file. Use ?diamonds, after you load ggplot2 package.)

3 Part II: Extending Your Group Project (30 pts)

In this part you are going to extend your group project with a single additional analysis supported by some visualization. You are tasked with finding the best improvement on the top of your group project. About one page is enough, two pages tops.

4 Part III: Welcome to Real Life (50 pts)

As all of you know well enough; real life data is not readly available and it is messy. Also you will face situations where you need to discover and learn another framework. In this part, you are going to gather data about organic agriculture production from Ministry of Agriculture and Forestry. You should use (Organik Tarımsal Üretim Verileri) between 2014-2018 from https://www.tarimorman.gov.tr/Konular/Bitkisel-Uretim/Organik-Tarim/Istatistikler. Take some time to see what is offered in the data sets. Choose an interesting theme which can be analyzed with the given data and collect relevant data from the service. Some example themes can be as follows.

  1. Gather the data, bind them together and save it in an .RData file. You can make .RData file available online for everybody. Provide the data link in your analysis. You can work together with your friends to provide one comprehensive .RData file if it is more convenient to you. (You don’t need to report any code in this part.)

Tip: You can use readxl package for xlsx and xls files.

  1. Perform EDA on the data you collected based on the theme you decided on. Keep it short. One page is enough, two pages tops. Original and interesting work is important (data sharing is good, but be careful about idea sharing). If you are interested and want to keep going, write a data blog post about it. I will not grade it but I can share it on social media.

5 ANSWERS

5.1 Part - 1

1. I think that AI is a buzzword of the 21st century. Although it has a lot of potential to solve problems, it will probably give rise to new problems. Since AI is based on data analysis, personal data sharing is an important brake for AI. It is important to what extent people and institutions want to share their personal data. Artificial intelligence needs to make a fair decision about processes. Because it is a very easy field to manipulate, different results can be corrected in the hands of the wrong people. Nevertheless, I think it will play an important role in solving too many human problems.

In order to construct artificial intelligence, it may be necessary to encourage people with different domain knowledge to learn what they can do with data analysis. This will support AI-based thinking. In addition, data collection on processes should be systematic and orderly.

If you need to compare Turkey with the world, we’re behind the world average as the angle of view of artificial intelligence and technology. If we assess the current situation with the world as 50, turkey will be up to 7-8.

2. First of all, I try to understand columns of data. To do this, it can look at descriptive statistics of columns such as their mean, median, distributions. summary() function can be useful for this purpose.

  • Column names and types can be converted a useful form to understand easily.
  • If there are null columns, they can be filled properly. For example, if column distribution is the normal distribution, null columns can be filled with mean. To give another example, column distribution does not fit any distribution suitable, the mode can be logical to fill null columns.
  • I try to choose the columns can I used in my analysis.
  • Looking inside of data randomly.
  • If there is an objective of the analysis, I do my analysis based on this objective. If I do not or my objective is general, I try to give more neutral results.
  • To answer your questions, I would probably prefer “Pain Points in Our Society and Optimal Budget Allocation” as my title. I think that the title of the analysis is as important as analysis because when people saw a sentence, they have an emotion about the article and read analysis or article with this point of view. So, the title would be non-biased.

3. I want to analysis whether the carats of diamonds and their price are related. The easiest way to understand the correlation between two variables is to examine the distribution graph. To do this, I used log() function to understand graph easily. Additionaly, I wondered what effect the color of the diamond has on the price.

Results :

As it can be seen on the above, carat plays a major role on diamond’s price. Also, diamond with D color is more expensive than diamond with J.

5.2 Part - 2

In term project, we examined that how affect interest rates on economic indicators. I wonder that whether tech exports are related to consumer price index. research and development expenditure vs. consumer price index researchers in rd

According to graph above, both Research and development expenditure (% of GDP) and consumer price index have increasing year by year. Unfortunetely, there is no enough evidence that

5.3 Part - 3

a. Below, you can find about my dataset for organic agriculture statistics.

5.3.0.1 Description

finalExam_dataset has information about organic agriculture statistics which contain the number of farmer, gross production, production area on yearly basis and cities. You can download data from link.

5.3.0.2 Details

Rdatafile contain two dataframes which have agriculture statistics with detailed product information and total production information.

organic agriculture dataframe with 17597 rows and 4 column:

  • iller: Production city
  • Urun_adi: organic product name
  • Uretim_miktari: Gross production of product in given year
  • yıl: Production year

Total organic agriculture dataframe with 527 rows and 8 column:

  • iller: Production city
  • Gercek_ciftci_sayisi: The number of farmer
  • Gercek_uretim_alani: Absolute production area
  • Dogal_toplama_alani: Gathering ground
  • Nadas_alani: Fallow pasture
  • Toplam_alan: Total production area
  • Uretim_miktari: Production amount of product in given year
  • yıl: Production year

b. Project aim: Find the top 5 cities which produce organic products mostly, change in their production amount, the proportion of gathering naturally, and change in their production amonut of popular products.

  • Kilis is a city that has the highest unit production amount in Turkey. Fallow pasture is excepted in this calculation.

  • According to the graph above, the amount of production in the cities has increased year by year. Although the amount of production is small compared to other cities, Kilis has the highest unit production, which indicates that all of its production is organic.
  • According to graph above, farming is mostly done by growing crops. Collecting from nature has little share in organic produced products.
  • Most of the popular organic products had increased from year to year, apricot has experienced a serious decline in 2014.